262 research outputs found

    Classification on imbalanced data sets, taking advantage of errors to improve performance

    Get PDF
    Classification methods usually exhibit a poor performance when they are applied on imbalanced data sets. In order to overcome this problem, some algorithms have been proposed in the last decade. Most of them generate synthetic instances in order to balance data sets, regardless the classification algorithm. These methods work reasonably well in most cases; however, they tend to cause over-fitting. In this paper, we propose a method to face the imbalance problem. Our approach, which is very simple to implement, works in two phases; the first one detects instances that are difficult to predict correctly for classification methods. These instances are then categorized into “noisy” and “secure”, where the former refers to those instances whose most of their nearest neighbors belong to the opposite class. The second phase of our method, consists in generating a number of synthetic instances for each one of those that are difficult to predict correctly. After applying our method to data sets, the AUC area of classifiers is improved dramatically. We compare our method with others of the state-of-the-art, using more than 10 data sets

    Continuation for thin film hydrodynamics and related scalar problems

    Full text link
    This chapter illustrates how to apply continuation techniques in the analysis of a particular class of nonlinear kinetic equations that describe the time evolution through transport equations for a single scalar field like a densities or interface profiles of various types. We first systematically introduce these equations as gradient dynamics combining mass-conserving and nonmass-conserving fluxes followed by a discussion of nonvariational amendmends and a brief introduction to their analysis by numerical continuation. The approach is first applied to a number of common examples of variational equations, namely, Allen-Cahn- and Cahn-Hilliard-type equations including certain thin-film equations for partially wetting liquids on homogeneous and heterogeneous substrates as well as Swift-Hohenberg and Phase-Field-Crystal equations. Second we consider nonvariational examples as the Kuramoto-Sivashinsky equation, convective Allen-Cahn and Cahn-Hilliard equations and thin-film equations describing stationary sliding drops and a transversal front instability in a dip-coating. Through the different examples we illustrate how to employ the numerical tools provided by the packages auto07p and pde2path to determine steady, stationary and time-periodic solutions in one and two dimensions and the resulting bifurcation diagrams. The incorporation of boundary conditions and integral side conditions is also discussed as well as problem-specific implementation issues

    Illusory Stimuli Can Be Used to Identify Retinal Blind Spots

    Get PDF
    Background. Identification of visual field loss in people with retinal disease is not straightforward as people with eye disease are frequently unaware of substantial deficits in their visual field, as a consequence of perceptual completion ("filling-in'') of affected areas. Methodology. We attempted to induce a compelling visual illusion known as the induced twinkle after-effect (TwAE) in eight patients with retinal scotomas. Half of these patients experience filling-in of their scotomas such that they are unaware of the presence of their scotoma, and conventional campimetric techniques can not be used to identify their vision loss. The region of the TwAE was compared to microperimetry maps of the retinal lesion. Principal Findings. Six of our eight participants experienced the TwAE. This effect occurred in three of the four people who filled-in their scotoma. The boundary of the TwAE showed good agreement with the boundary of lesion, as determined by microperimetry. Conclusion. For the first time, we have determined vision loss by asking patients to report the presence of an illusory percept in blind areas, rather than the absence of a real stimulus. This illusory technique is quick, accurate and not subject to the effects of filling-in

    Predictive models for anti-tubercular molecules using machine learning on high-throughput biological screening datasets

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Tuberculosis is a contagious disease caused by <it>Mycobacterium tuberculosis </it>(Mtb), affecting more than two billion people around the globe and is one of the major causes of morbidity and mortality in the developing world. Recent reports suggest that Mtb has been developing resistance to the widely used anti-tubercular drugs resulting in the emergence and spread of multi drug-resistant (MDR) and extensively drug-resistant (XDR) strains throughout the world. In view of this global epidemic, there is an urgent need to facilitate fast and efficient lead identification methodologies. Target based screening of large compound libraries has been widely used as a fast and efficient approach for lead identification, but is restricted by the knowledge about the target structure. Whole organism screens on the other hand are target-agnostic and have been now widely employed as an alternative for lead identification but they are limited by the time and cost involved in running the screens for large compound libraries. This could be possibly be circumvented by using computational approaches to prioritize molecules for screening programmes.</p> <p>Results</p> <p>We utilized physicochemical properties of compounds to train four supervised classifiers (NaĂŻve Bayes, Random Forest, J48 and SMO) on three publicly available bioassay screens of Mtb inhibitors and validated the robustness of the predictive models using various statistical measures.</p> <p>Conclusions</p> <p>This study is a comprehensive analysis of high-throughput bioassay data for anti-tubercular activity and the application of machine learning approaches to create target-agnostic predictive models for anti-tubercular agents.</p

    Properties of Graphene: A Theoretical Perspective

    Full text link
    In this review, we provide an in-depth description of the physics of monolayer and bilayer graphene from a theorist's perspective. We discuss the physical properties of graphene in an external magnetic field, reflecting the chiral nature of the quasiparticles near the Dirac point with a Landau level at zero energy. We address the unique integer quantum Hall effects, the role of electron correlations, and the recent observation of the fractional quantum Hall effect in the monolayer graphene. The quantum Hall effect in bilayer graphene is fundamentally different from that of a monolayer, reflecting the unique band structure of this system. The theory of transport in the absence of an external magnetic field is discussed in detail, along with the role of disorder studied in various theoretical models. We highlight the differences and similarities between monolayer and bilayer graphene, and focus on thermodynamic properties such as the compressibility, the plasmon spectra, the weak localization correction, quantum Hall effect, and optical properties. Confinement of electrons in graphene is nontrivial due to Klein tunneling. We review various theoretical and experimental studies of quantum confined structures made from graphene. The band structure of graphene nanoribbons and the role of the sublattice symmetry, edge geometry and the size of the nanoribbon on the electronic and magnetic properties are very active areas of research, and a detailed review of these topics is presented. Also, the effects of substrate interactions, adsorbed atoms, lattice defects and doping on the band structure of finite-sized graphene systems are discussed. We also include a brief description of graphane -- gapped material obtained from graphene by attaching hydrogen atoms to each carbon atom in the lattice.Comment: 189 pages. submitted in Advances in Physic

    Improving a gold standard: treating human relevance judgments of MEDLINE document pairs

    Get PDF
    Given prior human judgments of the condition of an object it is possible to use these judgments to make a maximal likelihood estimate of what future human judgments of the condition of that object will be. However, if one has a reasonably large collection of similar objects and the prior human judgments of a number of judges regarding the condition of each object in the collection, then it is possible to make predictions of future human judgments for the whole collection that are superior to the simple maximal likelihood estimate for each object in isolation. This is possible because the multiple judgments over the collection allow an analysis to determine the relative value of a judge as compared with the other judges in the group and this value can be used to augment or diminish a particular judge’s influence in predicting future judgments. Here we study and compare five different methods for making such improved predictions and show that each is superior to simple maximal likelihood estimates

    Identification and characterization of seed-specific transcription factors regulating anthocyanin biosynthesis in black rice

    Get PDF
    Black rice is rich in anthocyanin and is expected to have more healthful dietary potential than white rice. We assessed expression of anthocyanin in black rice cultivars using a newly designed 135 K Oryza sativa microarray. A total of 12,673 genes exhibited greater than 2.0-fold up- or down-regulation in comparisons between three rice cultivars and three seed developmental stages. The 137 transcription factor genes found to be associated with production of anthocyanin pigment were classified into 10 groups. In addition, 17 unknown and hypothetical genes were identified from comparisons between the rice cultivars. Finally, 15 out of the 17 candidate genes were verified by RT-PCR analysis. Among the genes, nine were up-regulated and six exhibited down-regulation. These genes likely play either a regulatory role in anthocyanin biosynthesis or are related to anthocyanin metabolism during flavonoid biosynthesis. While these genes require further validation, the results here underline the potential use of the new microarray and provide valuable insight into anthocyanin pigment production in rice

    Virtual Screening of Bioassay Data

    Get PDF
    Background: There are three main problems associated with the virtual screening of bioassay data. The first is access to freely-available curated data, the second is the number of false positives that occur in the physical primary screening process, and finally the data is highly-imbalanced with a low ratio of Active compounds to Inactive compounds. This paper first discusses these three problems and then a selection of Weka cost-sensitive classifiers (Naive Bayes, SVM, C4.5 and Random Forest) are applied to a variety of bioassay datasets. Results: Pharmaceutical bioassay data is not readily available to the academic community. The data held at PubChem is not curated and there is a lack of detailed cross-referencing between Primary and Confirmatory screening assays. With regard to the number of false positives that occur in the primary screening process, the analysis carried out has been shallow due to the lack of crossreferencing mentioned above. In six cases found, the average percentage of false positives from the High-Throughput Primary screen is quite high at 64%. For the cost-sensitive classification, Weka's implementations of the Support Vector Machine and C4.5 decision tree learner have performed relatively well. It was also found, that the setting of the Weka cost matrix is dependent on the base classifier used and not solely on the ratio of class imbalance. Conclusions: Understandably, pharmaceutical data is hard to obtain. However, it would be beneficial to both the pharmaceutical industry and to academics for curated primary screening and corresponding confirmatory data to be provided. Two benefits could be gained by employing virtual screening techniques to bioassay data. First, by reducing the search space of compounds to be screened and secondly, by analysing the false positives that occur in the primary screening process, the technology may be improved. The number of false positives arising from primary screening leads to the issue of whether this type of data should be used for virtual screening. Care when using Weka's cost-sensitive classifiers is needed - across the board misclassification costs based on class ratios should not be used when comparing differing classifiers for the same dataset

    Magnetotransport in an aluminum thin film on a GaAs substrate grown by molecular beam epitaxy

    Get PDF
    Magnetotransport measurements are performed on an aluminum thin film grown on a GaAs substrate. A crossover from electron- to hole-dominant transport can be inferred from both longitudinal resistivity and Hall resistivity with increasing the perpendicular magnetic field B. Also, phenomena of localization effects can be seen at low B. By analyzing the zero-field resistivity as a function of temperature T, we show the importance of surface scattering in such a nanoscale film
    • …
    corecore